# CheckpointEngine CheckpointEngine is a component used to synchronize model weights between trainer and inference processes, primarily used in RLHF training to synchronize weights between Actor models and Rollout samplers. ## Basic Interface ```python class CheckpointEngine(ABC): """Checkpoint engine base class The checkpoint engine handles weight synchronization between trainer and inference processes. """ @abstractmethod def prepare(self) -> dict[str, Any]: """Prepare for weight synchronization""" ... @abstractmethod def init_process_group(self, rank: int, world_size: int, **kwargs): """Initialize process group""" ... @abstractmethod async def send_weights(self, weight_generator): """Send weights (called in trainer process)""" ... @abstractmethod def receive_weights(self) -> AsyncGenerator: """Receive weights (called in inference process)""" ... @abstractmethod def finalize(self): """Clean up resources""" ... ``` ## Available Checkpoint Engines Twinkle provides two checkpoint engine implementations: ### NCCLCheckpointEngine A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs. - High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer - Zero-Copy: Direct transfer between GPU memories without going through CPU - Bucketed Transfer: Supports bucketed transfer for large models See: [NCCLCheckpointEngine](NCCLCheckpointEngine.md) ### HCCLCheckpointEngine A checkpoint engine that uses HCCL for weight transfer between Ascend NPUs. - NPU Optimized: Weight transfer optimized specifically for Ascend NPUs - Efficient Communication: Uses HCCL for high-speed communication between NPUs - Compatible Interface: Maintains consistent interface with NCCLCheckpointEngine See: [HCCLCheckpointEngine](HCCLCheckpointEngine.md) ## How to Choose - **NCCLCheckpointEngine**: Suitable for GPU environments, provides the highest transfer performance - **HCCLCheckpointEngine**: Suitable for Ascend NPU environments > Checkpoint engine is a key component of RLHF training infrastructure, ensuring that trainers and samplers use consistent model weights. > Currently, synchronization is divided into two cases based on merge_and_sync=True/False. When set to True, the LoRA is merged into the base model and then synchronized. > When set to False, only the LoRA weights are synchronized. Additionally, for multi-tenant scenarios, LoRA files are directly attached to vLLM. > When merge_and_sync=False or in multi-tenant mode, vLLM's startup parameter enable_lora=True needs to be enabled. When merge_and_sync=True or using full parameters, this value should be set to False.